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ABSTRACT 

The problem of Cholesky factorization of a sparse matrix has been very well investigated 
on sequential machines. A number of efficient codes exist for factorizing large unstructured 
sparse matrices, for example, codes from Harwell Subroutine Library [4] and Sparspak [7]. 
However, there is a lack of such efficient codes on parallel machines in general, and distributed 
memory machines in particular. Some of the issues which are critical to the implementation 
of sparse Cholesky factorization on a distributed memory parallel machine are: ordering , 
partitioning and mapping , load balancing , and ordering of various tasks within a processor. 
Addressing these issues optimally for unstructured sparse matrices is a challenging task. 
In this paper we focus on the effect of various partitioning schemes on the performance of 
sparse Cholesky factorization on the INTEL iPSC/860. We also propose a new partitioning 
heuristic for structured as well as unstructured sparse matrices, and compare its performance 
with the other schemes. 


1 Research was supported in part by the National Aeronautics and Space Administration under NASA 
Contract No. NAS 1-18605 while the author was in residence at the Institute for Computer Applications in 
Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23665. 
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1 Introduction 


The problem of Cholesky factorization of a sparse matrix has been very well investigated 
on sequential machines. A number of efficient codes exist for factorizing large unstructured 
sparse matrices, for example, codes from Harwell Subroutine Library [4] and Sparspak [7]. 
However, there is a lack of such efficient codes on parallel machines in general, and distributed 
memory machines in particular. This is partly because these machines are relatively new 
and there is not much experience to solve unstructured problems on these machines. How- 
ever, there has been reasonable success in putting unstructured Euler codes on a distributed 
memory parallel machine [3]. (For these codes, in contrast to unstructured sparse factoriza- 
tion codes, ordering of various tasks such as computation and communication, within each 
processor is not an issue.) 

Some of the issues which are critical to the implementation of sparse Cholesky factor- 
ization on a distributed memory parallel machine are: ordering , partitioning and mapping , 
load balancing, and ordering of various tasks within a processor. Addressing these issues 
optimally is a challenging task. For example, it is not clear what is a good ordering scheme 
for parallel factorization. (Recently, reordering schemes for parallel factorization have been 
suggested in the literature, for example [12]. But, we are not aware of any performance fig- 
ures for these orderings on an actual distributed memory parallel machine. ) The problem 
becomes more complex because a solution obtained at a step may influence the solution at 
the next step. For example, an ordering which minimizes fill may lead to an unbalanced 
load. 

In the past, some attempts have been made to implement the Cholesky factorization 
for structured sparse matrices on the INTEL iPSC/2. George et al. have described an im- 
plementation of the fan-out algorithm [9]. Recently, Ashcraft et al. [1] have presented a 
compute-ahead implementation of fan-in the algorithm. Although, relative performance of 
their implementation is better than the basic fan-in and fan-out implementation, the abso- 
lute performance is far from desirable. For example, the factorization time reported for a 
75 x 75 grid problem using compute ahead fan-in is 1.561 seconds on a 64-processor machine. 
In megaflops this is approximately 0.075 mflops per node. This is significantly low when one 
considers the performance of the existing sequential codes on RISC based workstations. The 
MA27 code on IBM RS/6000, 41 MHz machine, gives 11 MFlops for medium sized matrices 
[15]. Ashcraft et al. [2] have compared the communication requirement of distributed mul- 
tifrontal schemes with the fan-out and fan-in schemes. It should be noted that most of the 
above mentioned studies have been done for structured sparse matrices arising from regular 
grids. For these problems, ordering, partitioning, and load balancing do not pose a significant 
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problem. Nested dissection is used for ordering these problems as it gives optimal-order fill 
and well-balanced elimination trees [7]. The partitioning scheme suggested in [8] for regular 
grid problems results in good load balancing and low communication cost. For unstructured 
sparse matrices not many results have been reported. 

Venugopal and Naik have recently studied partitioning and scheduling methodology for 
unstructured sparse matrix factorization on distributed memory machines [13]. However, 
they do not report any performance results of their studies on an actual machine. 

We feel that there is much more to be done before one can possibly get good performance 
for sparse Cholesky factorization on a distributed memory parallel machine. Our effort is a 
step in that direction. In this paper, we study the effect of various partitioning schemes on 
the performance of sparse Cholesky factorization on the INTEL iPSC/860. We also propose 
a new partitioning heuristic for structured as well as unstructured sparse matrices, and com- 
pare its performance with the other schemes. The distributed factorization algorithm which 
has been implemented is a variation of the distributed fan-out algorithm. The distributed 
fan-out algorithm is known to have greater interprocessor communication costs than the 
other distributed algorithms [11]. We still selected this algorithm because (i) it is simple to 
implement, and (ii) the focus of our research was to study the effect of partitioning on the 
performance of factorization. 

The rest of the paper is organized as follows. In the next section we review the basic 
Cholesky algorithm for solving a linear system of equations. Section 3 gives a brief de- 
scription of various partitioning schemes including the proposed heuristic. In Section 4 we 
briefly describe the implementation of factorization algorithm. The experimental results are 
discussed in Section 5. Finally in Section 6 we give the conclusions. 

2 Background 

Consider a system of linear equations, 

/ 

Ax = b, 

where A is an n x n symmetric positive definite matrix, b is a known vector and x is the 
unknown vector to be computed. One way to solve such linear systems is to compute the 
Cholesky factorization of matrix A , 

A = LL t , 

where L is a lower triangular matrix. Then x is computed by solving the triangular systems 

Ly = b, and 
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L t x = y. 

We now briefly discuss the basic steps involved in the solution of such a system on a dis- 
tributed memory parallel machine. 

(i) Ordering : Find an ordering P of the sparse matrix A so that the Cholesky factor L 
of PAP t suffers little fill and at the same time reduces the parallel time. 

(ii) Preprocessing: It consists of three parts. The first part is to determine the structure of 
L. The second part is to obtain the partitioning and mapping, that is the distribution 
of the columns of A amongst various processors of the machine. The third part is 
to create the required data structures for each processor for the numeric factorization 
step. 

(iii) Numeric Factorization: Compute the Cholesky factor L of PAP T . 

(iv) Triangular Solution: Solve Ly = Pb and L T z = y , and then set x = P T z. 

In this paper, we have focussed on the effect of partitioning on the performance of nu- 
meric factorization. We discuss them in a little more detail. 

Partitioning. Given a graph G of n nodes associated with the L -f L T matrix, find p 
(n » p) partitions of the graph with (i) large number of intra-partition edges and very few 
inter-partition edges, and (ii) nearly equal computational load for each partition. 

Numeric Factorization. The distributed algorithms proposed in the literature are based 
on the column-oriented Cholesky factorization. Following [8, 14], the basic column-oriented 
algorithm can be expressed as 

begin 

for j = 1 to n do 
begin 

for k = 1 to j-1 do 
cmod(j,k) 
cdiv(j) 

end 

end 
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where, 

cmod(j,k) : is modification of column j by column k (k < j ), and is also referred as the 
update computation. 

cdiv(j) : is division of column j by a scalar, and is also referred as the factorize computation. 

The three basic distributed algorithms reported in the literature are distributed fan-in, 
distributed fan-out and distributed multifrontal. We do not discuss these algorithms here. ' 
For a detailed discussion of these algorithms along with their advantages and disadvantages, 
one may refer to [11]. For this paper, we have implemented a variation of the distributed 
fan-out algorithm. 

3 Partitioning Schemes 

In this section we first briefly discuss some of the existing partitioning schemes which have 
been investigated in this paper, and finally we describe a new partitioning heuristic for 
structured as well as unstructured sparse matrices. 

3.1 Subtree-to-Subcube 

This scheme was suggested by George et al. [8] for regular grid problems. An example 
illustrating the partitioning and mapping of a 7 x 7 grid is shown in Figure 1. For details 
one can refer to [8]. 

3.2 Contiguous column 

In this scheme contiguous columns of a sparse matrix are assigned to a partition such that 
there is a uniform distribution of columns amongst various partitions. 

3.3 Contiguous column with uniform operation count 

In this scheme contiguous columns of a sparse matrix are assigned to a partition such that 
the number of operations required to factorize columns in a partition is nearly equal for all 
the partitions. 

3.4 Wrap-around 

In this scheme zth column of a sparse matrix is assigned to ( i — 1) mod p partition, where p 
is the total number of partitions. 
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3.5 Heuristic 

We first describe how to form a single partition using this scheme. Consider the Cholesky 
factorization of a given sparse symmetric positive definite matrix A into LL T . Assume 
that the matrix A has already been ordered by some permutation. Let G(F) be the graph 
associated with F — L - f- XT. In the discussion here and the rest of the paper also, we use 
the term node of a graph G and a column of sparse matrix A interchangeably. Pick a node 
(column) of G, which does not depend on any other nodes (columns) for factorization, as 
the first node of the partition. We now do a breadth first search on G starting with this 
node. At any level of the breadth first search we have a set of visited nodes. Out of this set 
we select only those nodes which satisfy some criterion. The rest of the nodes are rejected, 
and the breadth first search is continued from the selected set of nodes. This process can be 
viewed as pruning of the breadth first search tree. We stop when either the computation 
load corresponding to the partition reaches a fixed threshold, or there are no more unvisited 
nodes. 

Before forming the next partition, we mark all the nodes which were selected for the pre- 
vious partition as visited. Note that all the nodes which were rejected are marked unvisited. 
In case a starting node (which does not depend on any other nodes for factorization) is not 
found for the new partition, an arbitrary node from the unvisited nodes is selected. It is 
possible that after forming all the partitions there are some free nodes, that is nodes which 
have not been included in any partition. These nodes are distributed such that a free node 
is assigned to a partition which has most of its neighbors. 

Selection criterion. A node is included in the partition if most of its neighbors have already 
been included in the partition. The selection criterion is made stronger with the addition of 
more nodes in the partition. Consequently, at later stages of the partition formation, fewei 
and fewer nodes from a visited set of nodes are selected. Concretely, a node i is included in 
the partition if 

indeg(i) opc ^ 

outdeg{i ) mop 

where, 

indeg(i).'. for a given partition it is the number of nodes within the partition that are adjacent 
to node i, 

outdeg(i).: for a given partition it is the number of nodes outside the partition that are 
adjacent to node i, 

opc.: is the number of flops associated with the current partition, 

mop.: is the number of flops needed to factorize the complete sparse matrix divided by the 
number of partitions, and 
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a: is a tunable parameter with value greater than equal to zero. 

Note that for a — 0 there is no pruning of the breadth first search tree. We illustrate the 
partitioning heuristic with the help of an example. Consider the graph G(F ) of Figure 2a. 
We indicate the operation count associated with a node i by c(i). The total operation count 
for the example of Figure 2a is 75. Let us assume that we are interested in forming two 
partitions. Thus the value of mop in Eq.(l) is 38. The partition formation by the heuristic 
scheme with a = 0 is illustrated in Figure 2b. During the formation of the first partition 
we pick node 1 as the starting node. Note that node 1 does not depend on any other node 
for factorization. The operation count associated with this node is 3. Thus the value of ope 
(which is the number of floating point operations corresponding to nodes currently in the 
partition) is initialized to 3. At the next level we select nodes 4 and 7 (See Eq. (1)). The 
value of opc at this point becomes 24. In the next level, we first select node 5 which makes 
opc = 33, and then node 6 is selected which takes the value of opc to 44. Since the value of 
opc at this point is greater than the value of mop , the formation of first partition is stopped. 
Similarly the second partition is formed starting with node 2. 

4 Implementation 

Our factorization scheme can be considered as a compute-ahead implementation of the fan- 
out algorithm. In our algorithm description, which follows next, we make a distinction 
between two types of computations for a typical column at a processor. The first is the 
update computation which is the modification of a column by other columns, and the other 
is the factorize computation which is the division of a column by a scalar. The factorize 
computation is done on a column when it is completely updated by all the required columns. 
The complete algorithm can be best explained informally by considering a ready queue at 
each processor. The ready queue is initialized with the column numbers which do not require 
any update, and are ready for factorization. The program at a node can be described as 
follows. 

Algorithm: Distributed factorizaton 

10 while ready queue is not empty do 

• factorize the first ready column available in the queue. 

• send the factorized column to off-processors. 

• update the required local columns by the factorized column. 

• update the ready queue, that is a column which has been 

completely updated is inserted in the ready queue. 

end while 
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if all local columns have been factorized and sent then stop, 
if a column is received then 

• update the dependent local columns. 

• update the ready queue. 

end if 
go to 10. 


The detailed code for the implementation can be found in [16]. 

5 Experimental Results 

We did a series of experiments to evaluate the performance of various partitioning schemes on 
INTEL iPSC/860. The experiments were done on both structured and unstructured sparse 
matrices. Table 1 lists three structured matrices arising from nine-point finite difference 
operators on square grids. In the first set of experiments, we evaluated the performance 
of five different partitioning schemes for these matrices. The five partitioning schemes are: 
(i) wrap around ( wr ), (ii) contiguous column with uniform distribution of columns (cel), 
(iii) contiguous column with uniform distribution of operation counts (cc2), (iv) subtree- 
to-subcube (ss), and (v) heuristic (hr). For all the schemes, except subtree-to-subcube, we 
used the minimum degree ordering given in sparspak [7] for ordering the sparse matrices. For 
subtree-to-subcube partitioning the matrices were ordered using nested dissection ordering 
[7]. Table 2 summarizes the total factorization time in seconds (tt) for different partitioning 
schemes for three types of matrices with varying number of processors. It is clear from the 
table that the performance of wr and ccl is not comparable with the other three schemes. 
This can be explained by further examining the performance of a 75 x 75 grid problem on a 
4-processor machine. We observed the distribution of operation count, computation time (the 
time spent on computation on a processor), and total time on all the four processors. These 
observations are tabulated in Tables 3a, 3b and 3c. It is interesting to observe from these 
tables that the reason for the bad performance of wr is different from the one for ccl. The 
wr scheme results in uniform distribution of operation counts, but the time spent in com- 
munication is relatively greater. On the other hand, the ccl scheme results in non-uniform 
distribution of operation counts which is mainly responsible for the its bad performance. 

Another observation can be made from Table 2, that is, as the number of processors 
is increased the performance of cc2 and hr becomes better than that of ss. The worse 
performance of ss could also be due to the use of distributed fan-out algorithm for numeric 
factorization. The distributed fan-out algorithm is known for not exploiting the subtree-to- 
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subcube mapping effectively [11]. We also give the performance of various schemes in mflops 
on a 16 processor machine (see Table 4). For 100 x 100 grid problem we obtained 9.65 mflops 
which is around 0.6 mflops per processor. 

The second set of experiments was done on unstructured matrices from Harwell Boeing 
Collection [5] (see Table 5). For these matrices we compare the performance of cc2 with hr. 
The results are summarized in Table 6. For each scheme, we have listed the computation 
time ( ct ), and the total time (tt). We have also listed the performance in mflops for the two 
schemes. We observe that the performance of hr is better than that of cc2. To understand 
this behavior, we observed the distribution of operation counts, computation time, and total 
time over the processors of the machine. We summarize these results in Table 7. For both 
the schemes we have listed the standard deviation in, 

(i) operation count as a fraction of average operation count on a processor ( sdopc ), 

(ii) computation time as a fraction of average computation time on a processor ( sdct ), and 

(iii) total time as a fraction of average total time on a processor ( sdtt ). 

It is obvious from this table that both the schemes result in balanced computation and 
communication. The performance of cc2 is bad because it results in large volume of commu- 
nication traffic as compared to the hr scheme. 

6 Conclusion 

In this paper, we have studied the effect of various partitioning schemes on the performance 
of sparse factorization on INTEL iPSC/860 for structured as well as unstructured sparse 
matrices. We show that the proposed partitioning heuristic works for both structured and 
unstructured sparse matrices. The absolute performance of the factorization step was not 
that impressive. We believe it can be improved by implementing a distributed algorithm 
which (i) maximize the performance at each processor by exploiting the cache behavior, and 
(ii) orders the computations at each processor to minimize the communication overheads. 
One such algorithm, in our opinion, is a distributed multifrontal scheme. 
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Table 1 . List of structured matrices arising from nine-point finite difference operators 
square grid. 


grid problem 

order 

nonzeros 

operation count 

50 x 50 

2500 

12202 

2032374 

75 x 75 

5625 

27677 

7227520 

100 x 100 

10000 

49402 

17562662 


Table 2. Performance of various partitioning schemes for structured matrices. 


grid problem 

np 

tt (sec' 


wr 

ccl 

cc2 

ss 

hr 

50 x 50 

2 

1.28 

0.95 

0.92 

0.92 

0.94 


4 

1.33 

0.82 

0.67 

0.73 

0.66 


8 

1.65 

0.82 

0.44 

0.70 

0.44 


16 

1.87 

0.84 

0.29 

0.64 

0.36 

75 x 75 

2 

4.23 

3.04 

2.88 

2.88 

3.36 


4 

4.81 

2.51 

2.13 

1.92 

2.04 


8 

5.36 

2.45 

1.29 

1.58 

1.29 


16 

7.31 

2.52 

0.82 

1.52 

0.96 

100 x 100 

2 

7.96 

7.31 

6.89 

6.87 

8.22 


4 

5.44 

5.97 

4.97 

4.43 

4.91 


8 

4.21 

5.72 

2.97 

3.45 

2.83 


16 

3.53 

8.26 

1.85 

3.28 

1.82 


on a 
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Table 3a. Distribution of operation counts for a 75 x 75 grid problem on a 4-processor 
machine. 


part, scheme 

pO 

pl 

p2 

p3 

wr 

1804438 

1804649 

1813194 

1805239 

ccl 

1266845 

1856797 

1266953 

2836925 

cc2 

1777056 

1803289 

1784693 

1862482 

ss 

1830051 

1805154 

1814518 

1777797 

hr 

1820510 

1818536 

1807575 

1780899 


Table 3b. Distribution of computation time (in sec) for a 75 x 75 grid problem on a 4- 
processor machine. 


part, scheme 

P o 

pl 

P 2 

p3 

wr 

1.61 

1.80 

1.59 

1.76 

ccl 

0.84 

1.30 

0.84 

2.38 

cc2 

1.19 

1.38 

1.24 

1.40 

ss 

1.66 

1.65 

1.67 

1.66 

hr 

1.32 

1.36 

1.34 

1.22 


Table 3c. Distribution of total time (in sec) for a 75 x 75 grid problem on a 4-processor 
machine. 


part, scheme 

p0 

Pl 

P 2 

p3 

wr 

4.81 

4.81 

4.81 

4.81 

ccl 

0.91 

1.40 

0.91 

2.51 

cc2 

1.33 

1.61 

1.38 

2.13 

ss 

1.92 

1.92 

1.92 

1.92 

hr 

1.99 

2.03 

2.04 

1.51 
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Table 4. Performance in mflops of various partitioning schemes for a 16-processor machine 


grid problem 

mflops 

wr 

ccl 

cc2 

ss 

hr 

50 x 50 

1.09 

2.42 

7.00 

3.18 

5.65 

75 x 75 

0.99 

2.87 

8.81 

4.75 

7.53 

100 x 100 

4.98 

2.13 

9.49 



5.35 

9.65 


Table 5. List of Harwell Boeing test matrices. 


matrix 

order 

nonzeros 

operation count 

bcsstkl6 

4884 I 

147631 

184196735 

bcsstkl7 

10974 

219812 

214447177 

bcsstkl8 

11948 

80519 

162705482 

bcsstk28 

4410 

111717 

40562546 


Table 6. Performance of Harwell Boeing test matrices. 


matrix 

cc2 

hr 

mflops 

ct 

tt 

ct 

tt 

cc2 

hr 

bcsstkl6 

8.77 

41.15 

7.88 

32.07 

4.48 

5.74 

bcsstkl7 

11.07 

37.53 

10.20 

26.95 

5.71 

7.96 

bcsstkl8 

9.30 

26.47 

7.18 

21.97 

6.15 

7.41 

bcsstk28 

2.21 

9.00 

1.75 

5.97 

4.5 

6.79 


Table 7. Load distribution of Harwell Boeing test matrices. 


matrix 

sdopc 

sdct 

sdtt 

cc2 

hr 

cc2 

hr 

cc2 

hr 

bcsstkl6 

0.01 

0.01 

0.07 

0.05 

0.37 

0.42 

bcsstkl7 

0.01 

0.02 

0.07 

0.04 

0.32 

0.31 

bcsstkl8 

0.01 

0.03 

0.09 

0.03 

0.22 

0.10 

bcsstk28 

0.02 

0.03 

0.07 

0.04 

0.35 

0.23 
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Ordering Partitioning 


Figure 1. Nested dissection ordering and subtree-to-subcube partitioning on 7 x 7 grid. 
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Figure 2a. The graph G(F). (c(i) is the operation count associated with node i) 



Formation of first partition Formation of second partition 


Figure 2b. Various stages of partition formation. 

(opc gives the current operation count of a partition) 
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